Content

 1. Introduction
   1.1 Background
   1.2 Research Method
   1.3 Model design
 2. Dataset preparetion
   2.1 Dataset
   2.2 Select variable
   2.3 Clean dataset
   2.4 Convert variable
   2.5 Check dataset
 3. Understand the dataset
   3.1 Statistical description
   3.2 Visualizaiton of base information
   3.3 Exam each independent variable and its sampling error
 4 Build multiple linear regression model
 5 Test multiple linear regression model
   5.1 Assumption1: The model is correctly specified
   5.2 Assumption2: The residuals have the same variance and are independent from one another
   5.3 Assumption3: There are no linear relationship between independent variables
   5.4 Assumption4: The independent variables are fixed and measured without error

1. Introduction

1.1 Background

Due to the closure of schools and daycare services due to the COVID-19 pandemic, the gender gap in working hours in the United States from February 2020 to April 2020 has increased by 20% to 50%, and mothers with children have reduced working hours Four to five times [1]. According to think tanks, the gender gap between paid and unpaid work has narrowed since the mid-1970s, but the gap is still large [4]. In all developed countries, women (especially mothers) have less paid working hours than their spouses [4]. Moreover, the survey showed that women did not have more leisure time because of fewer work hours; on the contrary, they had less leisure time [2]. Working hours affect work experience, which in turn affects wage levels. According to [3], the correlation between women’s working hours and subsequent promotions will be stronger than men’s. So the work hour gap may be the reason for the lower average salary. This kind of gap may be that women’s time may be more occupied by unpaid work due to the difference in labor division in the family.

1.2 Research Method

This report tried to address the question by taking data that is representative of the population of interest to construct multiple linear regression model . The depended variable should be work hour. The key independent variable is gender.Other independent variable are other influencital factors on work hour.

1.3 Model design

This part will introduce how to choose variables and their measurement. There are plenty of potential explanations for work hours. Being male or female may influence industry type choice, level of education, and other characteristics that also impact work hours. However, this analysis focuses on the direct impact of gender on work hours. It is initial to account for this indirect impact of work hours to avoid omitted variable bias. It is also essential to consider another direct impact of work hours, like age, region. Considering the current data set and previous theory, choose age, region, major, marital status, and industry to explain work hours except for gender. The model will measure the strength of the impact of each factor on work hours. Work hours should be measured by the total actual work time per week of an individual.

2. Dataset preparetion

2.1 Dataset

The dataset used in research is The UK Time Diary Study 2014 - 2015 (tidy data), which focuses on time use. The dataset does not contain the entire population, but a sample of them, include people who report work hours and older than 16 years old. It is proved that the dataset is representative so that the result can be generalized from the sample into the UK population. One file of the dataset provides statistics summarized time information extracted from the activity schedule calculated by the Centre for Time Use Research (CTUR) team, so the quality of the actual work hour per week is excepted to be good. The unit of analysis is individual. Every individual has many features as variables, including time use and personal characteristics.

list_origin <- read.delim("uktus15_individual.tab")
dim(list_origin)
## [1] 11421   603

The dataset contains 11421 obs. of 603 variables.

2.2 Select variable

In order to avoid overcomplicating things with unnecessary data, this research uses simple or grouped classifications of some variables

  • serial (ratio) : household number
  • dtotac (ratio) : Total actual work hours in all jobs and businesses per week
  • DMSex (nominal) : Sex of respondent
  • DVAge (ratio) : Age in years
  • dukcntr (ordinal) : Country of Usual Residence (UK)
  • dsic (ordinal) : SIC 2007 industry divisions (grouped)
  • dmarsta (ordinal) : Marital status
  • dhiqual(ordinal) : Highest qualification
  • dsoc (ordinal) : SOC 2010 Major groups
list <- select(list_origin,serial,dtotac,DMSex,DVAge,dukcntr,dsic,dmarsta,dhiqual,dsoc)

2.3 Clean dataset

Filter un-useful and missing values

Work Hours Restrict work hour > 0, only consider economically active people.

nrow(list) 
## [1] 11421
list <- filter(list, dtotac > 0)
nrow(list) 
## [1] 3738

According to the Office of National Statistics, the UK population in 2014 is around 64.6 million, employment-population in 2014 is around 30 million. Therefore, 50% of the observation is expected to lose; they are economically inactive people, like students, household wives, people seeking a job, retired, and sick. From 11421 to 3738, we lose 67% of the observation. Most of the removed observations are the observations with the variable value of -1. According to the document, -1 means “item not applicable.” It may include inactive people mentioned above, people who cannot accurately measure the working hours (e.g., some consultants), people who can not distinguish work and break time (e.g., some content creators), etc. The cleaning process also removes the observation with a value of 0 (working time is 0); they may be on vacation during the survey. And -8 (Don’t know); -7 (Interview not achieved), -9 (No answer/refused). Therefore, the dataset is representative from this perspective. The reduction is acceptable and has no essential influence on research. Then restrict people to work less than 112 hours per week. Since too long work hour is very likely to be a statistical error.

list <- filter(list, dtotac < 112)
nrow(list) 
## [1] 3736

Age

People are expected to retire at 65.

list <- filter(list, DVAge < 65)
nrow(list)  
## [1] 3642

Industry/Marital status/Education/Major group Exclude negative numbers

list <- filter(list, dsic > 0)
list <- filter(list, dmarsta > 0)
list <- filter(list, dhiqual > 0)
list <- filter(list, dsoc > 0)
nrow(list)  
## [1] 3585

There is little difference that can be ignored. The reason may be that these variables are closely related to the dependent variable-work hour handled above. After cleaning,dataset still, have enough observation, and they are representative.

2.4 Convert variable

Convert ordinal data into factors to let numbers represent categories. Rename numeric variable to make it easier to understand.

2.5 Check dataset

glimpse(list)
## Rows: 3,585
## Columns: 9
## $ serial        <int> 11010904, 11010906, 11010906, 11010907, 11010917, 1101…
## $ workHour      <int> 32, 21, 47, 30, 21, 40, 38, 45, 8, 40, 38, 40, 30, 30,…
## $ age           <int> 62, 52, 48, 36, 42, 48, 60, 44, 40, 54, 23, 47, 50, 53…
## $ industry      <fct> "Public admin, education and health", "Other services"…
## $ gender        <fct> Male, Female, Male, Male, Female, Female, Female, Male…
## $ region        <fct> England, England, England, England, England, England, …
## $ maritalStatus <fct> "Married/cohabitating", "Married/cohabitating", "Marri…
## $ education     <fct> Higher education, Secondary, Secondary, Secondary, Sec…
## $ majorGroup    <fct> PProfessionals, Elementary occupations, Skilled Trade,…
levels(list$industry)
## [1] "Agriculture, forestry and fishing"   
## [2] "Manufacturing"                       
## [3] "Energy and water supply"             
## [4] "Construction"                        
## [5] "Distribution, hotels and restaurants"
## [6] "Transport and communication"         
## [7] "Banking and finance"                 
## [8] "Public admin, education and health"  
## [9] "Other services"
levels(list$region)
## [1] "England"          "Wales"            "Scotland"         "Northern Ireland"
levels(list$maritalStatus)
## [1] "Single, never married" "Married/cohabitating"  "Divorced/widowed"
levels(list$majorGroup)
## [1] "Managers"               "PProfessionals"         "Assoc. professionals"  
## [4] "Administrative"         "Skilled Trade"          "Caring & Leisure"      
## [7] "Sales & cust services"  "Machine Operatives"     "Elementary occupations"
levels(list$education)
## [1] "Degree or higher"      "Higher education"      "A level or equivalent"
## [4] "Secondary"             "Other"

3. Understand the dataset

3.1 Statistical description

Central tendency of continous single variables

Typical value of work hours (ratio variable)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   29.00   38.00   36.48   44.00  110.00

Typical value of age (ratio variable)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   30.00   41.00   40.45   50.00   64.00

Those two variables have no clear outliers.

Central tendency of univariate variables

The difference in work hour between men and women

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   gender `mean(workHour)`
##   <fct>             <dbl>
## 1 Male               41.0
## 2 Female             32.3

Use cross tabulation to explore the relationship between gender and other independent variable which indirectly influenc work hour.=

##                                       
##                                             Male    Female
##   Agriculture, forestry and fishing    0.7500000 0.2500000
##   Manufacturing                        0.7156627 0.2843373
##   Energy and water supply              0.7285714 0.2714286
##   Construction                         0.8603352 0.1396648
##   Distribution, hotels and restaurants 0.4474474 0.5525526
##   Transport and communication          0.7816092 0.2183908
##   Banking and finance                  0.5027027 0.4972973
##   Public admin, education and health   0.2863777 0.7136223
##   Other services                       0.5255102 0.4744898
##                        
##                              Male    Female
##   Degree or higher      0.4592094 0.5407906
##   Higher education      0.5185185 0.4814815
##   A level or equivalent 0.4309463 0.5690537
##   Secondary             0.5129108 0.4870892
##   Other                 0.5531915 0.4468085
##                         
##                               Male    Female
##   Managers               0.6126984 0.3873016
##   PProfessionals         0.4523507 0.5476493
##   Assoc. professionals   0.5311871 0.4688129
##   Administrative         0.2359813 0.7640187
##   Skilled Trade          0.8941980 0.1058020
##   Caring & Leisure       0.1549708 0.8450292
##   Sales & cust services  0.3411371 0.6588629
##   Machine Operatives     0.8652174 0.1347826
##   Elementary occupations 0.4822335 0.5177665

We can see clear difference in industry, major group propotion between man and women, but little difference in education.

3.2 Visualizaiton of base information

Distribution of dependent variable——work hour As the dependent variable is continuous, this research use histogram to show its distribution.

Very little people work longer than 90 hours per week,so adjust the upper limit of working hours to 90 hours. Than draw the plot. It can be seen that the work hour is close to a normal distribution. There is a clusterings of observations around the mean (main peak). The farther away from the peak, the shorter the working hours. That is in line with common sense that people who work too long and too short are a minority.

Distribution of work hour by gender

It shows the relationship between gender and work hour. The work hour peak of men is a little larger than that of women, in consistant with descriptive statistics that for the average work hour for men were higher. The work hour distribution of men and that of women are skewed, but the overlap part is nearly normal distributied. Women are clustered at the lower end of the distribution Long work hours are dominated by the man The working hours of females start from the peak and decrease slowly in the direction with less time, and decrease steeply in the direction with more time. The opposite is true for males. Female’s work time distribution is more flatter。 This reflects that in addition to the normal working hours of most people, there are more women with shorter working hours, and the opposite is true for men.

Distribution of work hour by industry and gender

There is a large variation in work hours between men and women. In every industry, men work longer than women. The largest disparity in work hour between men and women appears in industry of distribution, hotels ans restaurants. The means of work hour of man among industries are similar. But that of women varies a lot.

3.3 Exam each independent variable and its sampling error

In-depth study of the effects of several influencing factors of work hour on it, in preparation for the establishment of multiple linear regression model.

3.3.1 Work hour versus gender

## `summarise()` ungrouping output (override with `.groups` argument)
Gender Mean Work Hour Var Std.Dev
Male 40.97442 163.3164 12.77953
Female 32.32708 187.8747 13.70674

Mean work hour varies between the gender. There are more variation of work hour for women and the distribution is less compressed, which may because a large amount of women do short-time job (Quartile is very small). So, it may cause pure heteroscedasticity.

Exam sampling error

  • Null hypothesis:

The average work hour between men and women in population is equal

  • Alternative hypothesis:

The average work hour between men and women in population is not equal

  • Test statistics

t-value: where the coefficient could be zero in the population. Reject the null hypothesis if the probability (p-value) is equal to or less than 0.05.

## 
##  Welch Two Sample t-test
## 
## data:  workHour by gender
## t = 19.548, df = 3582.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  7.780025 9.514657
## sample estimates:
##   mean in group Male mean in group Female 
##             40.97442             32.32708
  • Level of significance

The t-value is 19.35, p-value < 0.05. If the null hypothesis is true, then there is a probability of 0.00000 of observing a p-value as or more extreme than 19.35. We reject null hypothesis and think the average work hour of men is statistically significantly different from that of women.

3.3.2 Work hour versus highest qualification

## `summarise()` ungrouping output (override with `.groups` argument)
Highest Qualification Mean Work Hour Var Std.Dev
Degree or higher 38.89571 173.5430 13.17357
Higher education 36.21417 199.2557 14.11580
A level or equivalent 35.15857 198.8737 14.10226
Secondary 35.06103 205.4440 14.33332
Other 33.07801 187.8010 13.70405

Mean work hour varies with the education level. Disparity of work hour among higher education, A level or equivalent and Secondary is very little. There are more variation of work hour for workers with higher education level, which may because people with lower education level start to work Earlier to do some short-time work, making the distribution less compressed. But higher education level people tend to do stable full-time job generally. There are different degrees of compression of work hour distribution among education level, which may couse pure heteroscedasticity.

Exam sampling error

  • Null hypothesis:

The mean work hour is equal across the education groups

  • Alternative hypothesis:

At least one of the education groups mean work hour is different from that the others.

  • Test statistics

f-value: compare the variation among the groups within the variation within the groups. If the variation among the groups is larger, the groups are important.

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## education      4  11695  2923.8   15.26 2.25e-12 ***
## Residuals   3580 686153   191.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • level of significance

According to f-value and its p-value, hypothesis can be rejected. Education level does explain a statistically significant portion of variation in work hours.

3.3.3 Work hour versus industry

## `summarise()` ungrouping output (override with `.groups` argument)
Industry Mean Work Hour Var Std.Dev
Agriculture, forestry and fishing 42.06250 210.0625 14.49353
Manufacturing 40.24578 139.9974 11.83205
Energy and water supply 40.77143 121.2224 11.01010
Construction 42.84358 170.4473 13.05555
Distribution, hotels and restaurants 31.59009 224.6362 14.98787
Transport and communication 41.91954 149.3461 12.22072
Banking and finance 35.94595 154.0623 12.41218
Public admin, education and health 35.85604 186.8716 13.67010
Other services 36.66497 198.7138 14.09659

Mean work hour varies with the industry, Average work hour in some industry are similar, such as Manufacturing ,and Energy and water supply. There are more variation of work hour for workers in some industry, like Agriculture and forestry and fishing, Distribution, hotels and restaurants, Public admin, education and health, since work hour of these kind of job are flexible, causing much variation. Distribution of work hour in some industry are more compressed, like Energy and water supply, Banking and finance where people have more stable work time. So there may be pure heteroscedasticity.

Exam sampling error

  • Null hypothesis:

The mean work hour is equal across the industry groups

  • Alternative hypothesis:

At least one of the industry groups mean work hour is different from that the others.

  • test statistics

f-value

## 
## Call:
## lm(formula = workHour ~ industry, data = list)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.844  -6.920   1.144   7.410  73.335 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                   42.0625     3.3996  12.373
## industryManufacturing                         -1.8167     3.4646  -0.524
## industryEnergy and water supply               -1.2911     3.7682  -0.343
## industryConstruction                           0.7811     3.5483   0.220
## industryDistribution, hotels and restaurants -10.4724     3.4402  -3.044
## industryTransport and communication           -0.1430     3.5525  -0.040
## industryBanking and finance                   -6.1166     3.5436  -1.726
## industryPublic admin, education and health    -6.2065     3.4206  -1.814
## industryOther services                        -5.3975     3.4456  -1.567
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## industryManufacturing                         0.60005    
## industryEnergy and water supply               0.73190    
## industryConstruction                          0.82579    
## industryDistribution, hotels and restaurants  0.00235 ** 
## industryTransport and communication           0.96790    
## industryBanking and finance                   0.08442 .  
## industryPublic admin, education and health    0.06970 .  
## industryOther services                        0.11732    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.6 on 3576 degrees of freedom
## Multiple R-squared:  0.0524, Adjusted R-squared:  0.05029 
## F-statistic: 24.72 on 8 and 3576 DF,  p-value: < 2.2e-16
  • Level of significance

According to f-value and its p-value, hypothesis can be rejected. Industry groups does explain a statistically significant portion of variation in work hours

3.3.4 Work hours versus major group

## `summarise()` ungrouping output (override with `.groups` argument)
Major Group Mean Work Hour Var Std.Dev
Managers 42.52063 190.8109 13.81343
PProfessionals 39.87548 147.5621 12.14751
Assoc. professionals 38.36016 170.5817 13.06069
Administrative 32.77804 132.8569 11.52636
Skilled Trade 41.61775 154.2712 12.42060
Caring & Leisure 32.42398 199.1716 14.11282
Sales & cust services 29.86622 184.9820 13.60081
Machine Operatives 40.55652 163.3833 12.78215
Elementary occupations 28.81980 238.0056 15.42743

Mean work hour varies with the major Group, There are more variation of work hour for workers in some major group, like Caring & Leisure, Elementary occupations(mainly require the use of hand-held tools and often some physical effort). Their mean work hour are relatively shorter. People in that major are not likely to have stable work time. The smallest variation appears at administrative, since this major often have fixed working hours. So this may cause pure heteroscedasticity.

Exam sampling error

  • Null hypothesis:

The mean work hour is equal across the industry groups

  • Alternative hypothesis:

At least one of the industry groups mean work hour is different from that the others. * test statistics

f-value

##               Df Sum Sq Mean Sq F value Pr(>F)    
## majorGroup     8  81571   10196   59.16 <2e-16 ***
## Residuals   3576 616277     172                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • level of significance

According to f-value and its p-value, hypothesis can be rejected. Major groups does explain a statistically significant portion of variation in work hours.

3.3.5 Work hours versus region

## `summarise()` ungrouping output (override with `.groups` argument)
Country of UK Mean Work Hour Var Std.Dev
England 36.73779 198.0847 14.07426
Wales 34.31977 145.8562 12.07709
Scotland 35.17692 199.1114 14.11069
Northern Ireland 35.58824 154.9575 12.44819

Mean work hour varies with the industry. Some industry are similar, such as Agriculture, forestry and fishing, and manufacturing

Exam sampling error

  • Null hypothesis:

The mean work hour is equal across the industry groups

  • Alternative hypothesis:

At least one of the industry groups mean work hour is different from that the others.

  • test statistics

f-value

##               Df Sum Sq Mean Sq F value Pr(>F)  
## region         3   1528   509.3   2.619 0.0492 *
## Residuals   3581 696320   194.4                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • level of significance

According to f-value and its p-value, hypothesis can be rejected. Regions does explain a statistically significant portion of variation in work hours. According to 1528/1528+696320 = 0.002, it can only explain 0.2% of the work hour variation. It is not a considerable amount, so exclude the variable.

3.3.6 Work hours versus maritalStatus

## `summarise()` ungrouping output (override with `.groups` argument)
Country of UK Mean Work Hour Var Std.Dev
Single, never married 33.95921 193.1076 13.89632
Married/cohabitating 37.48189 193.5925 13.91375
Divorced/widowed 36.23867 183.5944 13.54970

Mean work hour varies with the marital status. The variations of different marital status are similar, so it is unlikely to cause pure heteroscedasticity.

Exam sampling error

  • Null hypothesis:

The mean work hour is equal across the industry groups

  • Alternative hypothesis:

At least mean work hour of one of the marital status is different from that of the others.

  • test statistics

f-value

##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## maritalStatus    2   8139    4069   21.13 7.51e-10 ***
## Residuals     3582 689710     193                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • level of significance

According to f-value and its p-value, hypothesis can be rejected. Marital status does explain a statistically significant portion of variation in work hours.

3.3.7 Work hour versus age

There is not a clear pattern.People between 25 and 50 years old have some outliers who work long hours. Because this age group is in better physical condition, they are more likely to spend more time at work. Then explore the relationship between age and work hour

## 
##  Pearson's product-moment correlation
## 
## data:  list$workHour and list$age
## t = 3.6069, df = 3583, p-value = 0.0003141
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02746640 0.09270246
## sample estimates:
##        cor 
## 0.06014865

The r-value is 0.06, so there is a very weak relationship between work hour and age. 即存在趋势总体上随着年龄增长,工作时间变长。 Exam sampling error

  • Null hypothesis:

Correlated coefficient for age and work hour in population is equal to 0.

  • Alternative hypothesis:

Correlated coefficient is not equal to 0.

  • test statistic methods

t-value is 3.6, and p-value is less than 0.05, so hypothesis is rejected.

4 Build multiple linear regression model

Use spatial to general (start with what we believe is important and add to it) method to construct model.

4.1 Gender

Begin with dependent variable and key independent variale

## 
## Call:
## lm(formula = workHour ~ gender, data = list)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.974  -7.327  -0.974   7.673  69.673 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   40.9744     0.3200  128.06   <2e-16 ***
## genderFemale  -8.6473     0.4436  -19.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.27 on 3583 degrees of freedom
## Multiple R-squared:  0.09588,    Adjusted R-squared:  0.09563 
## F-statistic:   380 on 1 and 3583 DF,  p-value: < 2.2e-16

Women work around 8.6 hours per week less than than men. The impact is statistically significant. Gender accounts for about 9.6% of the variations in work hour.

4.2 Add education group

## 
## Call:
## lm(formula = workHour ~ gender + education, data = list)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.653  -6.653   0.347   6.835  70.347 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     43.6532     0.4492  97.188  < 2e-16 ***
## genderFemale                    -8.7974     0.4403 -19.982  < 2e-16 ***
## educationHigher education       -3.2033     0.6508  -4.922 8.93e-07 ***
## educationA level or equivalent  -3.4885     0.6048  -5.768 8.70e-09 ***
## educationSecondary              -4.3071     0.5900  -7.301 3.51e-13 ***
## educationOther                  -6.6445     1.1705  -5.677 1.48e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.13 on 3579 degrees of freedom
## Multiple R-squared:  0.1154, Adjusted R-squared:  0.1142 
## F-statistic: 93.42 on 5 and 3579 DF,  p-value: < 2.2e-16

People tend to work longer with higher education qualification. Controlling the education level, man work 8.8 hours longer than women per week. The work hour gap between men and women is larger, possibilly because women have relatively higher education qualification. Now the work hour variation is explained 11.3%.

4.3 Add major group

## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup, data = list)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.171  -6.887   0.065   6.840  75.065 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       46.1712     0.8023  57.549  < 2e-16 ***
## genderFemale                      -7.3958     0.4684 -15.790  < 2e-16 ***
## educationHigher education         -1.5610     0.6591  -2.368  0.01792 *  
## educationA level or equivalent    -0.9012     0.6529  -1.380  0.16761    
## educationSecondary                -1.3175     0.6607  -1.994  0.04622 *  
## educationOther                    -2.5733     1.2077  -2.131  0.03318 *  
## majorGroupPProfessionals          -1.8884     0.8690  -2.173  0.02985 *  
## majorGroupAssoc. professionals    -3.6156     0.9153  -3.950 7.96e-05 ***
## majorGroupAdministrative          -6.7844     0.9632  -7.043 2.24e-12 ***
## majorGroupSkilled Trade           -2.5637     1.0551  -2.430  0.01516 *  
## majorGroupCaring & Leisure        -6.4316     1.0261  -6.268 4.10e-10 ***
## majorGroupSales & cust services  -10.3352     1.0424  -9.915  < 2e-16 ***
## majorGroupMachine Operatives      -3.2618     1.1351  -2.874  0.00408 ** 
## majorGroupElementary occupations -12.2516     0.9840 -12.451  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.69 on 3571 degrees of freedom
## Multiple R-squared:  0.1761, Adjusted R-squared:  0.1731 
## F-statistic: 58.72 on 13 and 3571 DF,  p-value: < 2.2e-16

Some of the major group is not statistically significant, like Professionals, Skilled Trade, compare to Managers. Other major group are statistically significant, like administrative, Sales&custom service and elementary occupations. People in that major work less. And the work hour gap between gender is smaller, which means women tend to have that kind of major. Now the model accounts for about 17.3% of the variations in work hour. But controlling major group makes the education level statistically insignificant, according to cross-tabulate.

##                         
##                          Degree or higher Higher education
##   Managers                    0.406349206      0.184126984
##   PProfessionals              0.735705210      0.151207116
##   Assoc. professionals        0.430583501      0.205231388
##   Administrative              0.235981308      0.184579439
##   Skilled Trade               0.071672355      0.208191126
##   Caring & Leisure            0.146198830      0.201754386
##   Sales & cust services       0.160535117      0.167224080
##   Machine Operatives          0.056521739      0.126086957
##   Elementary occupations      0.088832487      0.137055838
##                         
##                          A level or equivalent   Secondary       Other
##   Managers                         0.174603175 0.209523810 0.025396825
##   PProfessionals                   0.074968234 0.035578145 0.002541296
##   Assoc. professionals             0.191146881 0.167002012 0.006036217
##   Administrative                   0.273364486 0.289719626 0.016355140
##   Skilled Trade                    0.303754266 0.368600683 0.047781570
##   Caring & Leisure                 0.400584795 0.204678363 0.046783626
##   Sales & cust services            0.270903010 0.351170569 0.050167224
##   Machine Operatives               0.195652174 0.491304348 0.130434783
##   Elementary occupations           0.263959391 0.393401015 0.116751269

4.4 Add marital status

## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup + maritalStatus, 
##     data = list)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.607  -6.855  -0.152   6.936  74.408 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        44.7260     0.9187  48.681  < 2e-16 ***
## genderFemale                       -7.4319     0.4692 -15.841  < 2e-16 ***
## educationHigher education          -1.6636     0.6590  -2.525 0.011627 *  
## educationA level or equivalent     -0.8479     0.6523  -1.300 0.193751    
## educationSecondary                 -1.4928     0.6612  -2.258 0.024019 *  
## educationOther                     -3.0138     1.2117  -2.487 0.012914 *  
## majorGroupPProfessionals           -1.6863     0.8693  -1.940 0.052482 .  
## majorGroupAssoc. professionals     -3.4151     0.9160  -3.728 0.000196 ***
## majorGroupAdministrative           -6.6269     0.9631  -6.881 7.00e-12 ***
## majorGroupSkilled Trade            -2.2315     1.0575  -2.110 0.034919 *  
## majorGroupCaring & Leisure         -6.2047     1.0274  -6.039 1.71e-09 ***
## majorGroupSales & cust services    -9.8310     1.0520  -9.345  < 2e-16 ***
## majorGroupMachine Operatives       -3.0308     1.1355  -2.669 0.007637 ** 
## majorGroupElementary occupations  -11.7136     0.9975 -11.743  < 2e-16 ***
## maritalStatusMarried/cohabitating   1.5446     0.5094   3.032 0.002444 ** 
## maritalStatusDivorced/widowed       2.8402     0.8250   3.443 0.000583 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.67 on 3569 degrees of freedom
## Multiple R-squared:  0.1795, Adjusted R-squared:  0.176 
## F-statistic: 52.04 on 15 and 3569 DF,  p-value: < 2.2e-16

And the work hour gap between gender don’t change much, so as to other independent variables above, which means marital status don’t have clear relationship with independent variables above. Marital status explains statistically significant proportion of variations in work hour. Married/cohabitating people tend to work longer than single people and people never married. Divorced/widowed people tend to work much longer than single people and people never married. Now the model accounts for about 17.6% of the variations in work hour.

4.5 Add industry

## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup + maritalStatus + 
##     industry, data = list)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.802  -7.075   0.041   6.757  73.976 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                   50.4996     3.2914  15.343
## genderFemale                                  -6.8913     0.4798 -14.363
## educationHigher education                     -1.9022     0.6573  -2.894
## educationA level or equivalent                -0.9971     0.6501  -1.534
## educationSecondary                            -1.5560     0.6623  -2.350
## educationOther                                -3.1387     1.2071  -2.600
## majorGroupPProfessionals                      -1.5763     0.9017  -1.748
## majorGroupAssoc. professionals                -3.1774     0.9303  -3.415
## majorGroupAdministrative                      -6.5885     0.9691  -6.799
## majorGroupSkilled Trade                       -2.8800     1.0732  -2.684
## majorGroupCaring & Leisure                    -6.1448     1.0800  -5.690
## majorGroupSales & cust services               -8.5336     1.0893  -7.834
## majorGroupMachine Operatives                  -4.9554     1.1660  -4.250
## majorGroupElementary occupations             -11.6775     0.9979 -11.702
## maritalStatusMarried/cohabitating              1.2738     0.5102   2.497
## maritalStatusDivorced/widowed                  2.5569     0.8224   3.109
## industryManufacturing                         -3.9260     3.2190  -1.220
## industryEnergy and water supply               -3.6937     3.5007  -1.055
## industryConstruction                          -2.8210     3.2932  -0.857
## industryDistribution, hotels and restaurants  -7.6683     3.1998  -2.396
## industryTransport and communication           -0.9325     3.3086  -0.282
## industryBanking and finance                   -7.7458     3.3109  -2.339
## industryPublic admin, education and health    -6.1151     3.2035  -1.909
## industryOther services                        -6.2187     3.2088  -1.938
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## genderFemale                                  < 2e-16 ***
## educationHigher education                    0.003826 ** 
## educationA level or equivalent               0.125199    
## educationSecondary                           0.018852 *  
## educationOther                               0.009357 ** 
## majorGroupPProfessionals                     0.080540 .  
## majorGroupAssoc. professionals               0.000644 ***
## majorGroupAdministrative                     1.23e-11 ***
## majorGroupSkilled Trade                      0.007318 ** 
## majorGroupCaring & Leisure                   1.38e-08 ***
## majorGroupSales & cust services              6.20e-15 ***
## majorGroupMachine Operatives                 2.19e-05 ***
## majorGroupElementary occupations              < 2e-16 ***
## maritalStatusMarried/cohabitating            0.012576 *  
## maritalStatusDivorced/widowed                0.001891 ** 
## industryManufacturing                        0.222681    
## industryEnergy and water supply              0.291431    
## industryConstruction                         0.391723    
## industryDistribution, hotels and restaurants 0.016605 *  
## industryTransport and communication          0.778071    
## industryBanking and finance                  0.019366 *  
## industryPublic admin, education and health   0.056358 .  
## industryOther services                       0.052699 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.58 on 3561 degrees of freedom
## Multiple R-squared:  0.1924, Adjusted R-squared:  0.1872 
## F-statistic: 36.88 on 23 and 3561 DF,  p-value: < 2.2e-16

Majority of the industry groups are statistically insignificant. Now the model accounts for about 18.7% of the variations in work hour.

4.6 Age

## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup + maritalStatus + 
##     industry + age, data = list)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.918  -6.993   0.099   6.781  73.938 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   50.98722    3.37154  15.123
## genderFemale                                  -6.89520    0.47987 -14.369
## educationHigher education                     -1.86747    0.65939  -2.832
## educationA level or equivalent                -0.99396    0.65018  -1.529
## educationSecondary                            -1.49577    0.66844  -2.238
## educationOther                                -2.98998    1.22756  -2.436
## majorGroupPProfessionals                      -1.59984    0.90248  -1.773
## majorGroupAssoc. professionals                -3.22002    0.93258  -3.453
## majorGroupAdministrative                      -6.60882    0.96963  -6.816
## majorGroupSkilled Trade                       -2.90559    1.07397  -2.705
## majorGroupCaring & Leisure                    -6.18565    1.08180  -5.718
## majorGroupSales & cust services               -8.58124    1.09176  -7.860
## majorGroupMachine Operatives                  -4.97624    1.16651  -4.266
## majorGroupElementary occupations             -11.73339    1.00151 -11.716
## maritalStatusMarried/cohabitating              1.45510    0.57786   2.518
## maritalStatusDivorced/widowed                  2.80421    0.90187   3.109
## industryManufacturing                         -3.97959    3.22022  -1.236
## industryEnergy and water supply               -3.77627    3.50311  -1.078
## industryConstruction                          -2.88834    3.29503  -0.877
## industryDistribution, hotels and restaurants  -7.76621    3.20342  -2.424
## industryTransport and communication           -0.96987    3.30933  -0.293
## industryBanking and finance                   -7.81805    3.31294  -2.360
## industryPublic admin, education and health    -6.13916    3.20398  -1.916
## industryOther services                        -6.28213    3.21044  -1.957
## age                                           -0.01407    0.02106  -0.668
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## genderFemale                                  < 2e-16 ***
## educationHigher education                    0.004650 ** 
## educationA level or equivalent               0.126413    
## educationSecondary                           0.025302 *  
## educationOther                               0.014911 *  
## majorGroupPProfessionals                     0.076363 .  
## majorGroupAssoc. professionals               0.000561 ***
## majorGroupAdministrative                     1.10e-11 ***
## majorGroupSkilled Trade                      0.006853 ** 
## majorGroupCaring & Leisure                   1.17e-08 ***
## majorGroupSales & cust services              5.05e-15 ***
## majorGroupMachine Operatives                 2.04e-05 ***
## majorGroupElementary occupations              < 2e-16 ***
## maritalStatusMarried/cohabitating            0.011844 *  
## maritalStatusDivorced/widowed                0.001890 ** 
## industryManufacturing                        0.216610    
## industryEnergy and water supply              0.281117    
## industryConstruction                         0.380776    
## industryDistribution, hotels and restaurants 0.015385 *  
## industryTransport and communication          0.769484    
## industryBanking and finance                  0.018336 *  
## industryPublic admin, education and health   0.055432 .  
## industryOther services                       0.050451 .  
## age                                          0.504079    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.58 on 3560 degrees of freedom
## Multiple R-squared:  0.1925, Adjusted R-squared:  0.187 
## F-statistic: 35.36 on 24 and 3560 DF,  p-value: < 2.2e-16

Age is not statistically insignificant for work hour variations But according to the theory, they are related. So explore other relationship in the next. Since whether the residuals follow normal distribution has not been checked, we should not trust p-value totally. So, keep the variable.

4.7 Region

## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup + maritalStatus + 
##     industry + age + region, data = list)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.024  -6.946   0.139   6.897  73.820 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   50.97079    3.37050  15.123
## genderFemale                                  -6.91061    0.47976 -14.404
## educationHigher education                     -1.84055    0.65926  -2.792
## educationA level or equivalent                -1.03549    0.65039  -1.592
## educationSecondary                            -1.44209    0.66850  -2.157
## educationOther                                -2.88494    1.22815  -2.349
## majorGroupPProfessionals                      -1.55861    0.90251  -1.727
## majorGroupAssoc. professionals                -3.19694    0.93235  -3.429
## majorGroupAdministrative                      -6.53344    0.97037  -6.733
## majorGroupSkilled Trade                       -2.84303    1.07510  -2.644
## majorGroupCaring & Leisure                    -6.11441    1.08206  -5.651
## majorGroupSales & cust services               -8.53116    1.09197  -7.813
## majorGroupMachine Operatives                  -4.90831    1.16749  -4.204
## majorGroupElementary occupations             -11.74397    1.00161 -11.725
## maritalStatusMarried/cohabitating              1.47824    0.57780   2.558
## maritalStatusDivorced/widowed                  2.82626    0.90171   3.134
## industryManufacturing                         -3.89130    3.21939  -1.209
## industryEnergy and water supply               -3.73936    3.50192  -1.068
## industryConstruction                          -2.79855    3.29374  -0.850
## industryDistribution, hotels and restaurants  -7.61684    3.20283  -2.378
## industryTransport and communication           -0.94532    3.30851  -0.286
## industryBanking and finance                   -7.72552    3.31160  -2.333
## industryPublic admin, education and health    -5.98614    3.20321  -1.869
## industryOther services                        -6.16173    3.20950  -1.920
## age                                           -0.01348    0.02105  -0.640
## regionWales                                   -2.17378    0.99011  -2.195
## regionScotland                                -1.13024    0.81734  -1.383
## regionNorthern Ireland                        -0.26073    1.27304  -0.205
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## genderFemale                                  < 2e-16 ***
## educationHigher education                    0.005269 ** 
## educationA level or equivalent               0.111451    
## educationSecondary                           0.031057 *  
## educationOther                               0.018877 *  
## majorGroupPProfessionals                     0.084259 .  
## majorGroupAssoc. professionals               0.000613 ***
## majorGroupAdministrative                     1.93e-11 ***
## majorGroupSkilled Trade                      0.008219 ** 
## majorGroupCaring & Leisure                   1.72e-08 ***
## majorGroupSales & cust services              7.32e-15 ***
## majorGroupMachine Operatives                 2.68e-05 ***
## majorGroupElementary occupations              < 2e-16 ***
## maritalStatusMarried/cohabitating            0.010557 *  
## maritalStatusDivorced/widowed                0.001736 ** 
## industryManufacturing                        0.226855    
## industryEnergy and water supply              0.285683    
## industryConstruction                         0.395573    
## industryDistribution, hotels and restaurants 0.017452 *  
## industryTransport and communication          0.775105    
## industryBanking and finance                  0.019711 *  
## industryPublic admin, education and health   0.061733 .  
## industryOther services                       0.054958 .  
## age                                          0.521983    
## regionWales                                  0.028193 *  
## regionScotland                               0.166804    
## regionNorthern Ireland                       0.837734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.58 on 3557 degrees of freedom
## Multiple R-squared:  0.1939, Adjusted R-squared:  0.1878 
## F-statistic:  31.7 on 27 and 3557 DF,  p-value: < 2.2e-16

Region is not statistically significant. Since whether the residuals follow normal distribution has not been checked, we should not trust p-value totally. So, keep the variable.Now the model accounts for about 18.8% of the variations in work hour.

5 Test multiple linear regression model

Assume the data are representative for the population #Then discuss 4 assumptions in Peter Kennedy’s :a guide to econometrics" to refine model.

5.1 Assumption1: The model is correctly specified

5.1.1 Checking model for omitted variable bias

Focus on the effect of gender on working hours, that is to explore if two people are same in other characteristics except gender, will they have the same work hour. Control all variables that determine work hour and vary by sex. These research consider industry, education and major group. The variables are not complete. There are other explainations not included in the model (e.g., self-employed/employee). So, there are still omitted variable bias.

5.1.2 Checking model for outliers

Some outliers highly impact paramaters, which makes model not correctly specified. Have cleaned the data. So it is not expected to appear two much outliers. Now use Residuals vs Leverage for main regression model to figure out how influential a point is.

It is showed that some outliers appear and they have large values of residuals but none of them is highly influential. So dont remove.

5.1.3 Checking model for functional form

The only ratio independent variable in model is age. The model shows that there is not a linear pattern between age and work hour. According to visualizaiton part, many long work hour observations appear at people who are 25-50 years old. Consider changing the relationship between age and work hour to age = |age-A| which means the working hours are the longest at the age of A, and reach the peak. T he working hours of longer than A and shorter than A decrease with the increase of the distance to A.

Evaluatethe refined model:

R-squared

## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup + maritalStatus + 
##     industry + age + region, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.024  -6.946   0.139   6.897  73.820 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   50.97079    3.37050  15.123
## genderFemale                                  -6.91061    0.47976 -14.404
## educationHigher education                     -1.84055    0.65926  -2.792
## educationA level or equivalent                -1.03549    0.65039  -1.592
## educationSecondary                            -1.44209    0.66850  -2.157
## educationOther                                -2.88494    1.22815  -2.349
## majorGroupPProfessionals                      -1.55861    0.90251  -1.727
## majorGroupAssoc. professionals                -3.19694    0.93235  -3.429
## majorGroupAdministrative                      -6.53344    0.97037  -6.733
## majorGroupSkilled Trade                       -2.84303    1.07510  -2.644
## majorGroupCaring & Leisure                    -6.11441    1.08206  -5.651
## majorGroupSales & cust services               -8.53116    1.09197  -7.813
## majorGroupMachine Operatives                  -4.90831    1.16749  -4.204
## majorGroupElementary occupations             -11.74397    1.00161 -11.725
## maritalStatusMarried/cohabitating              1.47824    0.57780   2.558
## maritalStatusDivorced/widowed                  2.82626    0.90171   3.134
## industryManufacturing                         -3.89130    3.21939  -1.209
## industryEnergy and water supply               -3.73936    3.50192  -1.068
## industryConstruction                          -2.79855    3.29374  -0.850
## industryDistribution, hotels and restaurants  -7.61684    3.20283  -2.378
## industryTransport and communication           -0.94532    3.30851  -0.286
## industryBanking and finance                   -7.72552    3.31160  -2.333
## industryPublic admin, education and health    -5.98614    3.20321  -1.869
## industryOther services                        -6.16173    3.20950  -1.920
## age                                           -0.01348    0.02105  -0.640
## regionWales                                   -2.17378    0.99011  -2.195
## regionScotland                                -1.13024    0.81734  -1.383
## regionNorthern Ireland                        -0.26073    1.27304  -0.205
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## genderFemale                                  < 2e-16 ***
## educationHigher education                    0.005269 ** 
## educationA level or equivalent               0.111451    
## educationSecondary                           0.031057 *  
## educationOther                               0.018877 *  
## majorGroupPProfessionals                     0.084259 .  
## majorGroupAssoc. professionals               0.000613 ***
## majorGroupAdministrative                     1.93e-11 ***
## majorGroupSkilled Trade                      0.008219 ** 
## majorGroupCaring & Leisure                   1.72e-08 ***
## majorGroupSales & cust services              7.32e-15 ***
## majorGroupMachine Operatives                 2.68e-05 ***
## majorGroupElementary occupations              < 2e-16 ***
## maritalStatusMarried/cohabitating            0.010557 *  
## maritalStatusDivorced/widowed                0.001736 ** 
## industryManufacturing                        0.226855    
## industryEnergy and water supply              0.285683    
## industryConstruction                         0.395573    
## industryDistribution, hotels and restaurants 0.017452 *  
## industryTransport and communication          0.775105    
## industryBanking and finance                  0.019711 *  
## industryPublic admin, education and health   0.061733 .  
## industryOther services                       0.054958 .  
## age                                          0.521983    
## regionWales                                  0.028193 *  
## regionScotland                               0.166804    
## regionNorthern Ireland                       0.837734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.58 on 3557 degrees of freedom
## Multiple R-squared:  0.1939, Adjusted R-squared:  0.1878 
## F-statistic:  31.7 on 27 and 3557 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup + maritalStatus + 
##     industry + abs(age - 40) + region, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.502  -6.967   0.055   7.066  73.419 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   53.11528    3.31134  16.040
## genderFemale                                  -6.97885    0.47792 -14.603
## educationHigher education                     -1.74667    0.65490  -2.667
## educationA level or equivalent                -0.82799    0.64882  -1.276
## educationSecondary                            -1.29787    0.66070  -1.964
## educationOther                                -2.47710    1.20704  -2.052
## majorGroupPProfessionals                      -1.57878    0.89811  -1.758
## majorGroupAssoc. professionals                -3.23389    0.92643  -3.491
## majorGroupAdministrative                      -6.43647    0.96598  -6.663
## majorGroupSkilled Trade                       -2.80510    1.06999  -2.622
## majorGroupCaring & Leisure                    -5.97347    1.07601  -5.551
## majorGroupSales & cust services               -8.34713    1.08543  -7.690
## majorGroupMachine Operatives                  -5.03466    1.16255  -4.331
## majorGroupElementary occupations             -11.54635    0.99435 -11.612
## maritalStatusMarried/cohabitating              0.57138    0.52546   1.087
## maritalStatusDivorced/widowed                  2.02281    0.82553   2.450
## industryManufacturing                         -4.05758    3.20529  -1.266
## industryEnergy and water supply               -3.77256    3.48529  -1.082
## industryConstruction                          -3.00955    3.27894  -0.918
## industryDistribution, hotels and restaurants  -7.61728    3.18623  -2.391
## industryTransport and communication           -1.24244    3.29489  -0.377
## industryBanking and finance                   -7.87953    3.29637  -2.390
## industryPublic admin, education and health    -6.17737    3.18998  -1.936
## industryOther services                        -6.22014    3.19487  -1.947
## abs(age - 40)                                 -0.19006    0.03467  -5.482
## regionWales                                   -2.05077    0.98624  -2.079
## regionScotland                                -1.14569    0.81395  -1.408
## regionNorthern Ireland                        -0.43240    1.26814  -0.341
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## genderFemale                                  < 2e-16 ***
## educationHigher education                    0.007686 ** 
## educationA level or equivalent               0.201986    
## educationSecondary                           0.049564 *  
## educationOther                               0.040223 *  
## majorGroupPProfessionals                     0.078852 .  
## majorGroupAssoc. professionals               0.000488 ***
## majorGroupAdministrative                     3.09e-11 ***
## majorGroupSkilled Trade                      0.008789 ** 
## majorGroupCaring & Leisure                   3.04e-08 ***
## majorGroupSales & cust services              1.89e-14 ***
## majorGroupMachine Operatives                 1.53e-05 ***
## majorGroupElementary occupations              < 2e-16 ***
## maritalStatusMarried/cohabitating            0.276940    
## maritalStatusDivorced/widowed                0.014320 *  
## industryManufacturing                        0.205631    
## industryEnergy and water supply              0.279138    
## industryConstruction                         0.358765    
## industryDistribution, hotels and restaurants 0.016869 *  
## industryTransport and communication          0.706136    
## industryBanking and finance                  0.016883 *  
## industryPublic admin, education and health   0.052886 .  
## industryOther services                       0.051623 .  
## abs(age - 40)                                4.51e-08 ***
## regionWales                                  0.037653 *  
## regionScotland                               0.159346    
## regionNorthern Ireland                       0.733144    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.52 on 3557 degrees of freedom
## Multiple R-squared:  0.2006, Adjusted R-squared:  0.1945 
## F-statistic: 33.06 on 27 and 3557 DF,  p-value: < 2.2e-16

Compare their R-squared: 18.7% and 19.39%. Refined model explains more variations in work hour.

RESET test: detect structure in residuals

## 
##  RESET test
## 
## data:  .
## RESET = 11.087, df1 = 2, df2 = 3555, p-value = 1.585e-05
## 
##  RESET test
## 
## data:  .
## RESET = 8.3052, df1 = 2, df2 = 3555, p-value = 0.0002521

p-value is 6.157e-06 and 0.000124. Reject the null hypothesis which means there is structure in residuals and there is a strong indication that the model has misspecification of some sort. But the refined model is better.

Residual verses fitted plot: find non-linear patterns and problem

Red line is close to the doted line and it’s curved slightly. So, there is no clear non-linear pattern. There are some long work hour people where the model substantial under estimate their work hour (outliers). the points clustered more tightly around the zero-line for low fitted values. The model performs better at shorter work hour. The points are more spread out for higher fitted values. The estimate has more variation at longer work hour. Espectially more serious overprediction.

5.2 Assumption2: The residuals have the same variance and are independent from one another

5.2.1 Checking heteroscedasticity

Points in Residual verses fitted plot are not evenly spread, but looks slightly like a funnel shape, there may be heteroscedasticity. The potential reason might be misspecification which has been proved to exist above. The other reason is pure heteorosedasticity.The candidate in model are gender, education, industry, major group, according to statistical description in 3. Besides people in the same household may be related in some characteristics, like region, which may cause pure heteorosedasticity. Now Use Breusch-Pagan test (does independent variables help explain variance in the residuals) to check heteroscedasticity quantitatively.

## 
##  studentized Breusch-Pagan test
## 
## data:  .
## BP = 36.362, df = 27, p-value = 0.1076

p-value is 0.1076, > 0.05. Null hypothesis (the variance of residuals is consistant) cannot be rejected, so there is no strong indication that the model exhibits heteroscedasticity (including pure heteroscedasticity). So there is no need to present robust standard error to see which variable should be excluded.

5.2.2 Checking distribution of residuals

Use QQ-plot to check it as a matter of degree

At the lower end of the distribution, points are close to the dotted line; at the higher end of the distribution, points are far away. Residuals are not following normal distribution.

Now use logarithmic transformation to try to solve issues above. (dependent variable has no zeros).A semi-log regression

## 
## Call:
## lm(formula = log(workHour) ~ gender + education + majorGroup + 
##     maritalStatus + industry + abs(age - 40) + region, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3532 -0.1452  0.0764  0.2761  1.4754 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   4.060866   0.126113  32.200
## genderFemale                                 -0.242990   0.018201 -13.350
## educationHigher education                    -0.066720   0.024942  -2.675
## educationA level or equivalent               -0.024293   0.024710  -0.983
## educationSecondary                           -0.041191   0.025163  -1.637
## educationOther                               -0.078683   0.045970  -1.712
## majorGroupPProfessionals                     -0.008006   0.034205  -0.234
## majorGroupAssoc. professionals               -0.075134   0.035283  -2.129
## majorGroupAdministrative                     -0.164619   0.036789  -4.475
## majorGroupSkilled Trade                      -0.065966   0.040751  -1.619
## majorGroupCaring & Leisure                   -0.171745   0.040980  -4.191
## majorGroupSales & cust services              -0.237791   0.041339  -5.752
## majorGroupMachine Operatives                 -0.151578   0.044276  -3.424
## majorGroupElementary occupations             -0.428333   0.037870 -11.311
## maritalStatusMarried/cohabitating             0.019999   0.020012   0.999
## maritalStatusDivorced/widowed                 0.083952   0.031440   2.670
## industryManufacturing                        -0.138367   0.122074  -1.133
## industryEnergy and water supply              -0.110159   0.132737  -0.830
## industryConstruction                         -0.137142   0.124879  -1.098
## industryDistribution, hotels and restaurants -0.284044   0.121348  -2.341
## industryTransport and communication          -0.042140   0.125486  -0.336
## industryBanking and finance                  -0.296229   0.125542  -2.360
## industryPublic admin, education and health   -0.224410   0.121491  -1.847
## industryOther services                       -0.243275   0.121677  -1.999
## abs(age - 40)                                -0.007760   0.001321  -5.876
## regionWales                                  -0.046830   0.037561  -1.247
## regionScotland                               -0.032542   0.030999  -1.050
## regionNorthern Ireland                       -0.003518   0.048297  -0.073
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## genderFemale                                  < 2e-16 ***
## educationHigher education                    0.007507 ** 
## educationA level or equivalent               0.325625    
## educationSecondary                           0.101725    
## educationOther                               0.087057 .  
## majorGroupPProfessionals                     0.814940    
## majorGroupAssoc. professionals               0.033283 *  
## majorGroupAdministrative                     7.89e-06 ***
## majorGroupSkilled Trade                      0.105586    
## majorGroupCaring & Leisure                   2.85e-05 ***
## majorGroupSales & cust services              9.55e-09 ***
## majorGroupMachine Operatives                 0.000625 ***
## majorGroupElementary occupations              < 2e-16 ***
## maritalStatusMarried/cohabitating            0.317693    
## maritalStatusDivorced/widowed                0.007615 ** 
## industryManufacturing                        0.257093    
## industryEnergy and water supply              0.406650    
## industryConstruction                         0.272190    
## industryDistribution, hotels and restaurants 0.019300 *  
## industryTransport and communication          0.737034    
## industryBanking and finance                  0.018349 *  
## industryPublic admin, education and health   0.064810 .  
## industryOther services                       0.045646 *  
## abs(age - 40)                                4.58e-09 ***
## regionWales                                  0.212558    
## regionScotland                               0.293892    
## regionNorthern Ireland                       0.941935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.477 on 3557 degrees of freedom
## Multiple R-squared:  0.1835, Adjusted R-squared:  0.1773 
## F-statistic: 29.61 on 27 and 3557 DF,  p-value: < 2.2e-16

RESET test for the semi-log regression

## 
##  RESET test
## 
## data:  .
## RESET = 29.012, df1 = 2, df2 = 3555, p-value = 3.178e-13

Residuals versus fitted plot of the semi-log regression

BP-test of the semi-log regression

## 
##  studentized Breusch-Pagan test
## 
## data:  .
## BP = 108.79, df = 27, p-value = 8.869e-12

QQ-plot of the the semi-log regression

All test shows that the semi-log regression is less robust than the origin model.

5.2.3 Checking whether residuals are independent from one another

Knowing the residual of an observation should not be helpful to predict residual of another observation. However, the Dataset are collected by household, the observation from the same household are likely to have similar characteristics. That may against the assumption. THe potential problem the model has with lack of independence in residuals is that standard errors are calculated incorrectly. So use the clustered robust standard errors (let residuals be clustered within household, assume the residuals within the household are related; between the household are not related)

## R^2= 0.20059 
## 
##                                                 Estimate Std. Error     t value
## (Intercept)                                   53.1152827 3.96280343  13.4034614
## genderFemale                                  -6.9788458 0.47724841 -14.6230884
## educationHigher education                     -1.7466701 0.67989396  -2.5690331
## educationA level or equivalent                -0.8279910 0.65133077  -1.2712297
## educationSecondary                            -1.2978724 0.67119195  -1.9336829
## educationOther                                -2.4770955 1.24958096  -1.9823410
## majorGroupPProfessionals                      -1.5787813 0.92472328  -1.7073014
## majorGroupAssoc. professionals                -3.2338927 0.97704296  -3.3098777
## majorGroupAdministrative                      -6.4364736 0.98446932  -6.5380134
## majorGroupSkilled Trade                       -2.8050981 1.10913239  -2.5290922
## majorGroupCaring & Leisure                    -5.9734719 1.16825910  -5.1131396
## majorGroupSales & cust services               -8.3471257 1.14891183  -7.2652448
## majorGroupMachine Operatives                  -5.0346585 1.20587043  -4.1751239
## majorGroupElementary occupations             -11.5463481 1.10003660 -10.4963309
## maritalStatusMarried/cohabitating              0.5713757 0.51178317   1.1164410
## maritalStatusDivorced/widowed                  2.0228145 0.80444726   2.5145396
## regionWales                                   -2.0507697 0.82603426  -2.4826691
## regionScotland                                -1.1456882 0.82980704  -1.3806683
## regionNorthern Ireland                        -0.4324014 1.06918193  -0.4044227
## industryManufacturing                         -4.0575805 3.88056170  -1.0456168
## industryEnergy and water supply               -3.7725575 4.04412849  -0.9328481
## industryConstruction                          -3.0095466 3.93767722  -0.7642949
## industryDistribution, hotels and restaurants  -7.6172758 3.88008736  -1.9631712
## industryTransport and communication           -1.2424389 3.94891150  -0.3146282
## industryBanking and finance                   -7.8795268 3.96050272  -1.9895269
## industryPublic admin, education and health    -6.1773740 3.89543880  -1.5857967
## industryOther services                        -6.2201407 3.89623797  -1.5964478
## abs(age - 40)                                 -0.1900626 0.03530453  -5.3835192
##                                                  Pr(>|t|)
## (Intercept)                                  5.770769e-41
## genderFemale                                 2.001128e-48
## educationHigher education                    1.019827e-02
## educationA level or equivalent               2.036470e-01
## educationSecondary                           5.315212e-02
## educationOther                               4.744109e-02
## majorGroupPProfessionals                     8.776605e-02
## majorGroupAssoc. professionals               9.333675e-04
## majorGroupAdministrative                     6.234134e-11
## majorGroupSkilled Trade                      1.143580e-02
## majorGroupCaring & Leisure                   3.168479e-07
## majorGroupSales & cust services              3.723632e-13
## majorGroupMachine Operatives                 2.978239e-05
## majorGroupElementary occupations             8.980315e-26
## maritalStatusMarried/cohabitating            2.642334e-01
## maritalStatusDivorced/widowed                1.191879e-02
## regionWales                                  1.304022e-02
## regionScotland                               1.673810e-01
## regionNorthern Ireland                       6.859019e-01
## industryManufacturing                        2.957380e-01
## industryEnergy and water supply              3.508984e-01
## industryConstruction                         4.446915e-01
## industryDistribution, hotels and restaurants 4.962628e-02
## industryTransport and communication          7.530440e-01
## industryBanking and finance                  4.664307e-02
## industryPublic admin, education and health   1.127854e-01
## industryOther services                       1.103888e-01
## abs(age - 40)                                7.304344e-08

There is not much difference. So there is no sign that the standard errors are calculated incorrectly and that the model lacks of independence in residuals.

5.3 Assumption 3: There are no linear relationship between independent variables

Use Variance Inflation Factor(VIF) to check how much the variance of the coefficient estimate is being inflated by multicollinearity

##                   GVIF Df GVIF^(1/(2*Df))
## gender        1.303093  1        1.141531
## education     1.633899  4        1.063293
## majorGroup    4.054445  8        1.091430
## maritalStatus 1.179513  2        1.042139
## industry      2.845951  8        1.067552
## region        1.033220  3        1.005462
## abs(age - 40) 1.113772  1        1.055354

There mare some multicollinearity #There is no indication that multicollinearity is a problem.

5.4 Assumption4: The independent variables are fixed and measured without error

There is not any independent variables clearly determined by work hour. This research will not consider endogenous relationship.

Continue to check independent variables to try to refine model. It is include because work hour across countries in the world are different[].It is excepted that they are different across regions in UK (no evidence). But the region variable is not statistically significant. Now try to exclude it.

## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup + maritalStatus + 
##     industry + abs(age - 40), data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.365  -6.976   0.141   6.961  73.576 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   53.12133    3.31202  16.039
## genderFemale                                  -6.96560    0.47801 -14.572
## educationHigher education                     -1.77433    0.65499  -2.709
## educationA level or equivalent                -0.78481    0.64858  -1.210
## educationSecondary                            -1.35154    0.66059  -2.046
## educationOther                                -2.58551    1.20632  -2.143
## majorGroupPProfessionals                      -1.61950    0.89805  -1.803
## majorGroupAssoc. professionals                -3.25657    0.92661  -3.515
## majorGroupAdministrative                      -6.51377    0.96519  -6.749
## majorGroupSkilled Trade                       -2.87298    1.06879  -2.688
## majorGroupCaring & Leisure                    -6.04408    1.07570  -5.619
## majorGroupSales & cust services               -8.39795    1.08514  -7.739
## majorGroupMachine Operatives                  -5.10870    1.16154  -4.398
## majorGroupElementary occupations             -11.53667    0.99416 -11.604
## maritalStatusMarried/cohabitating              0.53659    0.52533   1.021
## maritalStatusDivorced/widowed                  1.99006    0.82540   2.411
## industryManufacturing                         -4.13534    3.20595  -1.290
## industryEnergy and water supply               -3.79784    3.48630  -1.089
## industryConstruction                          -3.09278    3.28005  -0.943
## industryDistribution, hotels and restaurants  -7.75010    3.18669  -2.432
## industryTransport and communication           -1.25565    3.29551  -0.381
## industryBanking and finance                   -7.96560    3.29753  -2.416
## industryPublic admin, education and health    -6.32009    3.19056  -1.981
## industryOther services                        -6.32833    3.19564  -1.980
## abs(age - 40)                                 -0.19142    0.03467  -5.522
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## genderFemale                                  < 2e-16 ***
## educationHigher education                    0.006782 ** 
## educationA level or equivalent               0.226339    
## educationSecondary                           0.040833 *  
## educationOther                               0.032156 *  
## majorGroupPProfessionals                     0.071416 .  
## majorGroupAssoc. professionals               0.000446 ***
## majorGroupAdministrative                     1.73e-11 ***
## majorGroupSkilled Trade                      0.007220 ** 
## majorGroupCaring & Leisure                   2.07e-08 ***
## majorGroupSales & cust services              1.30e-14 ***
## majorGroupMachine Operatives                 1.12e-05 ***
## majorGroupElementary occupations              < 2e-16 ***
## maritalStatusMarried/cohabitating            0.307121    
## maritalStatusDivorced/widowed                0.015958 *  
## industryManufacturing                        0.197171    
## industryEnergy and water supply              0.276068    
## industryConstruction                         0.345792    
## industryDistribution, hotels and restaurants 0.015064 *  
## industryTransport and communication          0.703212    
## industryBanking and finance                  0.015758 *  
## industryPublic admin, education and health   0.047683 *  
## industryOther services                       0.047747 *  
## abs(age - 40)                                3.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.53 on 3560 degrees of freedom
## Multiple R-squared:  0.1992, Adjusted R-squared:  0.1939 
## F-statistic: 36.91 on 24 and 3560 DF,  p-value: < 2.2e-16

RESET test for the semi-log regression

## 
##  RESET test
## 
## data:  .
## RESET = 9.0177, df1 = 2, df2 = 3558, p-value = 0.000124

Residuals versus fitted plot of the semi-log regression

BP-test of the semi-log regression

## 
##  studentized Breusch-Pagan test
## 
## data:  .
## BP = 29.77, df = 24, p-value = 0.1925

QQ-plot of the the semi-log regression

It is showed that the new model is less robust than original model. So Containing region variable makes sense.

One category of marital status variable is not statistically significant. But it can influence work hour in previous research[]. So it won’t be combine. Some categories of education variable is not statistically significant. But the classification is in order and cannot be combined randomly。 Some categories of industry variable is not statistically significant. And the classification has no order. Now try to reclassify it. Since the first 5 categories are more rely on physical labor, so try to combine them to increase efficiency. (“Agriculture, forestry and fishing”, “Manufacturing”,“Energy and water supply”,“Construction”)

list %>% mutate(industry = fct_collapse(industry, "Agriculture/Forestry and fishing/Manufacturing/Energy and water supply/Construction" = 
                                          c ("Agriculture, forestry and fishing","Manufacturing","Energy and water supply","Construction"))) %>%
  lm(workHour~gender+education+majorGroup+maritalStatus+region+industry+abs(age-40), data = .) %>%resettest()
## 
##  RESET test
## 
## data:  .
## RESET = 8.3813, df1 = 2, df2 = 3558, p-value = 0.0002337
list %>% mutate(industry = fct_collapse(industry, "Agriculture/Forestry and fishing/Manufacturing/Energy and water supply/Construction" = 
                                          c ("Agriculture, forestry and fishing","Manufacturing","Energy and water supply","Construction"))) %>%
lm(workHour~gender+education+majorGroup+maritalStatus+region+industry+abs(age-40), data = .) %>%bptest()
## 
##  studentized Breusch-Pagan test
## 
## data:  .
## BP = 32.919, df = 24, p-value = 0.1058

It is showed that the new model is less robust than original model. So keep the original model. The original model is:

## 
## Call:
## lm(formula = workHour ~ gender + education + majorGroup + maritalStatus + 
##     region + industry + abs(age - 40), data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.502  -6.967   0.055   7.066  73.419 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                   53.11528    3.31134  16.040
## genderFemale                                  -6.97885    0.47792 -14.603
## educationHigher education                     -1.74667    0.65490  -2.667
## educationA level or equivalent                -0.82799    0.64882  -1.276
## educationSecondary                            -1.29787    0.66070  -1.964
## educationOther                                -2.47710    1.20704  -2.052
## majorGroupPProfessionals                      -1.57878    0.89811  -1.758
## majorGroupAssoc. professionals                -3.23389    0.92643  -3.491
## majorGroupAdministrative                      -6.43647    0.96598  -6.663
## majorGroupSkilled Trade                       -2.80510    1.06999  -2.622
## majorGroupCaring & Leisure                    -5.97347    1.07601  -5.551
## majorGroupSales & cust services               -8.34713    1.08543  -7.690
## majorGroupMachine Operatives                  -5.03466    1.16255  -4.331
## majorGroupElementary occupations             -11.54635    0.99435 -11.612
## maritalStatusMarried/cohabitating              0.57138    0.52546   1.087
## maritalStatusDivorced/widowed                  2.02281    0.82553   2.450
## regionWales                                   -2.05077    0.98624  -2.079
## regionScotland                                -1.14569    0.81395  -1.408
## regionNorthern Ireland                        -0.43240    1.26814  -0.341
## industryManufacturing                         -4.05758    3.20529  -1.266
## industryEnergy and water supply               -3.77256    3.48529  -1.082
## industryConstruction                          -3.00955    3.27894  -0.918
## industryDistribution, hotels and restaurants  -7.61728    3.18623  -2.391
## industryTransport and communication           -1.24244    3.29489  -0.377
## industryBanking and finance                   -7.87953    3.29637  -2.390
## industryPublic admin, education and health    -6.17737    3.18998  -1.936
## industryOther services                        -6.22014    3.19487  -1.947
## abs(age - 40)                                 -0.19006    0.03467  -5.482
##                                              Pr(>|t|)    
## (Intercept)                                   < 2e-16 ***
## genderFemale                                  < 2e-16 ***
## educationHigher education                    0.007686 ** 
## educationA level or equivalent               0.201986    
## educationSecondary                           0.049564 *  
## educationOther                               0.040223 *  
## majorGroupPProfessionals                     0.078852 .  
## majorGroupAssoc. professionals               0.000488 ***
## majorGroupAdministrative                     3.09e-11 ***
## majorGroupSkilled Trade                      0.008789 ** 
## majorGroupCaring & Leisure                   3.04e-08 ***
## majorGroupSales & cust services              1.89e-14 ***
## majorGroupMachine Operatives                 1.53e-05 ***
## majorGroupElementary occupations              < 2e-16 ***
## maritalStatusMarried/cohabitating            0.276940    
## maritalStatusDivorced/widowed                0.014320 *  
## regionWales                                  0.037653 *  
## regionScotland                               0.159346    
## regionNorthern Ireland                       0.733144    
## industryManufacturing                        0.205631    
## industryEnergy and water supply              0.279138    
## industryConstruction                         0.358765    
## industryDistribution, hotels and restaurants 0.016869 *  
## industryTransport and communication          0.706136    
## industryBanking and finance                  0.016883 *  
## industryPublic admin, education and health   0.052886 .  
## industryOther services                       0.051623 .  
## abs(age - 40)                                4.51e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.52 on 3557 degrees of freedom
## Multiple R-squared:  0.2006, Adjusted R-squared:  0.1945 
## F-statistic: 33.06 on 27 and 3557 DF,  p-value: < 2.2e-16

The final model shows that female on average work 7 hours shorter than male per week, when account for education level, major group, marital status, region, industry and age. The work hour variation can be explained by these factors and other things, like time unnecessary occupied by other things, health status and so on. Although having some limitations, the model is relatively robust. The model is stable, because during the research, after changing the standard error calculation, combining categories of a variable, excluding variables.The result do not have much difference.

[1]https://onlinelibrary.wiley.com/doi/abs/10.1111/gwao.12506

[2]https://www.ons.gov.uk/economy/nationalaccounts/satelliteaccounts/articles/leisuretimeintheuk/2015

[3]http://ftp.iza.org/dp10454.pdf

[4]https://www.theguardian.com/uk-news/2020/jul/28/total-working-hours-now-near-equal-for-uk-men-and-women